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A Parallel Implemetation of Multilevel Recursive Spectral 
Bisection for Application to Adaptive Unstructured Meshes 

Stephen T. Barnard' Horst Simon* 


Abstract 

The design of a parallel implementation of multilevel recursive spectral bisection 
is described. The goal is to implement a code that is fast enough to enable dynamic 
repartitioning of adaptive meshes. 


1 Background 

The Recursive Spectral Bisection (RSB) algorithm is one of a class of recursive bisection 
methods for partitioning unstructured problems [4]. RSB is typically used as a preprocess- 
ing step prior to running a unstructured-mesh simulation on a massively parallel computer. 
Applications that change the mesh adaptively throughout the simulation, however, require 
a fast method to repartition “on the fly.” 

Finding a partition that both balances the work of all processors and minimizes 
interprocessor communication is an NP-hard problem. Therefore, all practical partitioning 
algorithms are necessarily heuristic approximations. Among these. RSB empirically 
provides the best partitions for a large set of problems, albeit at somewhat more run 
time than most other methods. RSB bisects a graph by first finding the eigenvector (the 
Fiedler vector) corresponding to the smallest non-trivial eigenvalue of the Laplacian matrix 
of a graph. The graph could represent, for example, the connectivity between elements in a 
finite-element mesh, or the connectivity between volumes in an unstructured finite- volume 
mesh. The vertices of the graph are reordered with the permutation induced by sorting the 
components of the Fiedler vector, and the graph is cut in half, with those vertices in the 
lower half of the new ordering in one part, and the remaining vertices in the other part. 

The most straightforward implementation of RSB, which uses the Lanczos algorithm to 
find the Fiedler vectors, is unacceptably slow for many applications. Multilevel recursive 
spectral bisection (MRSB) [1] is a refinement of the algorithm that is much faster, typically 
by an order of magnitude or more, and has been instrumental in the acceptance of RSB. 

The basic idea behind MRSB is to speed up the Fiedler- vector computation by constructing 
a series of successively smaller contracted graphs that maintain the global structure of the 
original graph. The Fiedler vector of the smallest graph is found quickly with the Lanczos 
algorithm. That result is interpolated to the next larger graph to form an approximate 
Fiedler vector, which is then refined with Rayleigh Quotient Iteration using the SYMMLQ 
algorithm. This process is repeated until the Fiedler vector of the original graph is obtained. 
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/* Partition matrix M into n parts. Assume n is a power of 2. */ 
void part it ion (struct matrix *M t int n) 

{ struct matrix 
if (n > 1) { 

bisect (M) ; /* set mask to {0,1} */ 

MO s copy_submaT:rix(M,Q) ; /* partition M */ 

Mi = copy^submatrixCM , i) ; 

part it ion (MO ,n/2) ; /* partition each submatrix */ 

partition(Mi ,n/2) ; 

remap (MO ,M1 ,n/2, M) ; /* remap processor assignments */ 

f ree.matrix(MO) ; /* free the submatricas */ 

free^matrix(Mi) ; 

» 

Fig. 1. partition, 

2 A Parallel Design for MRSB 

The MRSB algorithm is naturally recursive in two ways. First of ail it is a recursive 
bisection method; and secondly, at a lower level in the control structure. it uses a recursive 
multilevel technique to find Fiedler vectors, and hence to determine each bisection. These 
two quite different uses of recursion present different challenges (and opportunities) for 
parallel implementation. 

2.1 Recursive Bisection 

Recursive- bisection partitioning algorithms have the structure illustrated by partition, 
the C code in Figure 1. The partition routine takes as its input a sparse matrix M and 
assigns rows of M to n processors. The bisect routine simply assigns rows to elements of 
the set (0. 1}, while the remap routine promotes processor assignments for the submatrices 
to processor assignments tor the parent matrix. As it stands, part it icn works for any 
bisection algorithm. 

An efficient way to implement partition on a distributed-memory MIMD parallel 
computer is illustrated in Figure 2. Initially, a “task-team n of processors is formed, each cf 
which holds some number of rows of the matrix. .All processors participate in determining 
the first bisection, then they split into two smaller task-teams. These task teams are 
responsible for partitioning the two submatrices determined by the first bisection after the 
appropriate data has been copied. Recursive splitting proceeds down to task teams of single 
processors. Task teams are asynchronous: after a task team is split its two children can 
execute independently because ail barriers and other global operations are restricted to a 
particular task team. 

2.2 Recursive Multilevel Fiedler- Vector Computation 

MRSB bisects by sorting the Fiedler vector, which is computed recursively as illustrated by 
the C code in Figure 3. If the matrix is small the Fiedler vector is found with the Lanczos 
algorithm; otherwise, the matrix is contracted, the Fiedler vector of the smaller matrix is 
found recursively, the result is interpolated to the original matrix and improved with the 
Rayleigh Quotient Iteration algorithm. Unlike partition, the recursive computation of 
f iedler occurs over a single task team, as shown in Figure 4. 
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/* Find tie Fiedler vector ox the sparse Laplacian matrix M */ 


void fiedler (struct matrix *M) 
{ struct matrix *M_c; 

if (M->neqns <= min_neqns) i 
lanczos(M) ; 
else 

M_c = contract (M) ; 
f iedler (M.c) ; 
interpolate(>Lc,M) ; 
rqi(M) ; 

} 

> 


/* M is small enough */ 

/* use Lanczos */ 

/* M is too big */ 

/* contract matrix M */ 

/* find the Fiedler vector */ 
/* interpolate to M */ 

/* refine estimate vith RQI */ 


Fig. 3. Find Fiedler Vector , 
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FiG. 4. Recursive Multilevel Fiedler Computation 


We now examine the feasibility of implementing a parallel version of eacn step of this 
procedure: 

• The contraction step builds a smaller graph by selecting a maximal independent set 
of vertices and then connecting them with a breadth-first search through the original 
graph, (Two vertices in the maximal- independent set are connected in the contracted 
graph if and only if there is a shortest path in the original graph that does not contain 
a vertex in the maximal independent set.) 

- Luby [2j has shown how to find a maximal independent set efficiently in parallel. 
(Finding a maximal independent set was for a time thought to oe difficult to do 
in parallel, although the “greedy*' 1 serial algorithm is trivial.) 

- The breadth-first search (to connect members of the maximal independent set) 
can be implemented with independent processes for every vertex, simply using 
a low-level “add-haif-edge 71 routine that adds a single off-diagonal entry to the 
Laplacian matrix. The basic idea is to grow neighborhoods from each member of 
the maximal independent set, adding “half edges” when neighborhoods intersect, 
and terminating when ail vertices have been expanded. 

• The interpolation step (a simple prolongation operator) is easy, especially on a shared 
memory system. 

• RQI (Pcayleigh Quotient Iteration) is easy to implement in parallel because it only 
depends on BLAS1 and BLAS2 operations. 

• The Lanczos algorithm is more difficult, though not impossible, to parallelize, but this 
step requires only a very small amount of work because the final contracted graph 
is tiny. Mapping the Lanczos computation to one processor is a workable (if ugly) 
solution. .Another possibility we intend to investigate is to use Davidson's method, 
which Is easily parallelized. 


ini i Ji iJi.il. ,, mu min i mill 




SIAM Proceedings Series Macros 


o 


2.3 Other implementation issues. 

The new C implementation of MRSB takes an object-oriented approach. The code is 
areally simplified by using a C structure to encapsulate the data structures necessary 
for representing a graph and its associated Laplacian. matrix. The cheat supplies the 
^tial graph and optionally an array of vertex weights, then MRSB finds a mapping from 
vertices to processors, using a number of internal buffers not of interest to the client. The 
adjacency structures are represented as linked lists, which provide an efficient ana hex: e 
data structure for managing the unbounded and somewhat unpredictaoie requirements oi 

the contraction operation. . 

There are two additional operations thac must be parallelized that we haven . oiscussea. 

sorting and finding connected components. Several parallel algorithms are possible. In any 
case, these operations require such a small proportion of the total work that efficiency is 

not a serious concern. 


2.4 Shared Memory vs. Message Passing Implementations 
Unlike structured-grid codes, unstructured MPP applications must support highly irregular 
oatterns of communication. Every processor holds pointers not only to local data, but a*o 
io data on a variable number of neighboring processors. The essential problem « uO* 
to dereference these remote pointers efficiently, with no redundant communication. On a 
shared-memory system such as the T3D this is scarcely more difficult than dereferencing 
pointers to local data: if a reference is remote the data is merely read from the remote 
memory, with no need for the processor that •‘owns' 7 the data to do anything. - message- 
passing architecture, however, requires an elaborate preprocessing s.ep to sec up 
structures that can bd used for remote scatter/ gather operations. _ . 

Soarse matrix-vector multiplication illustrates this problem. A reasonably emciem 
shared-memory version is only slightly more complex than a serial version. The only way 
to write a sparse matrix-vector multiply of roughly the same code complexity on . message- 
passing svstem. however, requires global reductions of complexity O(niogm), wnere n is 
the number of rows and m is the number of processors. This leads to poor scaling as shown 
in Figure 5. To be sure, this is not the most efficient way to do matrix-vector multiplication 
with message-oassing: the point is that the only simple way is poor. 

We have implemented a “logarithmically parallel” version of MRSB. Suppose we have 
n processors, where n is a power of two and the processors are numbered {0, 1, ...» If- 
Frst coov the graph to all processors. The first bisection is computed by processor 0 using 
the serial multilevel code. The results of this bisection are sent to processor n/2 and then 
processor 0 and processor n/2 compute the bisections of the two new subgraphs. This is 

repeated recursively until the graph is fuily partitioned. _ , • t - 

This simple method is inefficient because the the computation of the first bisection 
involves only one processor, the next bisection involves only two, and so on. Nevertheless, 
the results are encouraging. The table shows the run times in seconds to partition a 
oraph with 15606 vertices and 45878 edges into 4 parts (npart) and 128 parts, lne 
“Multilevel Serial” times were obtained on a Silicon Graphics Indigo 2 workstation, .he 
“Multilevel T3D 77 times were obtained on a Cray T3D using the logant c y par e* 
method described above, and the “RSB. CM5” time was obtained on a 128-vector-umt CM5 

using the CMSSL routine (parallel single-level RSB). 

The multilevel algorithm on a single workstation outperforms the 128-processor CMo, 
demonstrating the effectiveness of the multilevel approach. The processors of the T3D 
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Fig. 5. Performance of Matrix- Vector Multiplication 


Table 1 

RSB Partitioning Run Times" 



have approximately 
processor T3D is on. 
aeariy free, however. 


;he same performance as the Indigo '2 processor (150 MHz), so the 4- 
slightly faster. Computing higher order partitionings on the T3D is 
because the parallelism rapidly becomes more effective as the number 


of partitions increases. 


3 Conclusions 

An efficient parallel implementation of MRSB to support repartitioning adaptive meshes 
is feasible. Recursive bisection in general can be implemented with recursive asynchronous 
task teams. Coding the recursive multilevel technique for finding Fiedler vectors, including 
the difficult contraction operation, is facilitated by the shared- memory architecture of 
the T3D. Preliminary testing suggests a substantial improvement in speed over currently 
available serial and parallel spectral partitioned. 
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